## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
First few lines of the dataset.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Data variable types. All of them are numbers and integers (X, quality).
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Summary statistics of all the variables in the wine data. Just glancing at the statistics shows that some variables (such as total.sulfur.dioxide) have a wider variance than others (such as chlorides).
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
A histogram of wine quality. Quality is integers with values ranging from 3 to 8. Most of them are 5 or 6.
A historam of alcohol levels. Alcohol doesn’t seem to be all integers. Reducing the binwidth shows this below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
##
## 8.4 8.5 8.7 8.8
## 2 1 2 2
## 9 9.05 9.1 9.2
## 30 1 23 72
## 9.23333333333333 9.25 9.3 9.4
## 1 1 59 103
## 9.5 9.55 9.56666666666667 9.6
## 139 2 1 59
## 9.7 9.8 9.9 9.95
## 54 78 49 1
## 10 10.0333333333333 10.1 10.2
## 67 2 47 46
## 10.3 10.4 10.5 10.55
## 33 41 67 2
## 10.6 10.7 10.75 10.8
## 28 27 1 42
## 10.9 11 11.0666666666667 11.1
## 49 59 1 27
## 11.2 11.3 11.4 11.5
## 36 32 32 30
## 11.6 11.7 11.8 11.9
## 15 23 29 20
## 11.95 12 12.1 12.2
## 1 21 13 12
## 12.3 12.4 12.5 12.6
## 12 13 21 6
## 12.7 12.8 12.9 13
## 9 17 9 6
## 13.1 13.2 13.3 13.4
## 2 1 3 3
## 13.5 13.5666666666667 13.6 14
## 1 1 4 7
## 14.9
## 1
Summary statistics and table of factor levels of alcohol. The most frequent alcohol levels are between 9 and 10. A half of them are 10.20 and below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Summary statistics and histogram of residual sugar. 75% of residual sugar is 2.6 or less. The histogram is right skewed a long tail.
Reducing the bin width of the historam for residual sugar shows the long tail better.
The below is the histogram of log-transformed residual sugar. This makes it easier to visualize to overcome the long tail issue.
Creating a new variable (alcohol+residual.sugar). The histogram below shows a long tail from residual sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
A histogram of pH levels and summary statistics. pH level is relatively normally distributed with the middle half (25% to 75%) between 3.210 and 3.4.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
A histogram of sulphates. This is also right skewed a long tail. Summary statistics show that 75% of values are between 0.33 and 0.73.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The histogram of density seems normally distributed with the middle half ranging from 0.9956 to 0.9978.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The histogram of total sulfur dioxide is right skewed. Summary statistics show that the first half are between 6 and 38. The max is 289.
The histogram of citric acid shows a non-parametric distribution. Reducing the bin width shows this better.There is also a large number of 0’s.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Summary statistics of citric acid.
## Mode FALSE TRUE NA's
## logical 1467 132 0
Summary statistics of citric acid that is equal to 1.
## Mode FALSE TRUE NA's
## logical 132 1467 0
Summary statistics of citric acid that is NOT equal to 1.
About 8% of the wines do not contain any citric acid. Only one of them has the max value of 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Histogram and summary statistics of volatile acidity. It’s right skewed with 75% of wines having 0.64 or less.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Histogram and summary statistics of fixed acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Histogram and summary statistics of chlorides. This is right-skewed and a long tail.The middle half are between 0.07 and 0.09. The log-transformed chlorides look more normal below.
There are 1599 red wine variants of the Portuguese “Vinho Verde” wine in the dataset with 12 features (the 13th feature(‘x’) is just unique id’s). All of the features are numerical variables and except for the quality feature that is discrete (between 3 and 8), they are all continous.
Quality is the output variable and the rest of the 11 features are chemical properties of each wine. I’m interested in finding out which chemical properties influence the quality of red wines. I suspect that combinations of these features would affect the quality more so than individual features.
From my initial univariate analysis alone, alcohol, sugar and dioxide are interesting in that the distributions are right skewed or non-parametric in case of citric acid.
The correlation matrix shows that alcohol has the highest positive correlation with quality and volatile acidity has the highest negative correlation with quality. Higher correlations between features seem to arise from chemical properties (ie. higher acidity and lower pH)
Because the plotting quality suffers from overplotting, the above scatterplot added jitter with alpha 1/2. We can see that a majority of alcohol level is less than 12 with quality of 5 and 7. While there is an overal trend of upward slope between alcohol and quality, at lower levels of alcohol, quality has a wide variation.
##
## Call:
## lm(formula = quality ~ alcohol, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
Running a linear regression on alcohol and quality shows that alcohol explains 22.6% of variation in quality.
## (0,9.5] (9.5,10.2] (10.2,11.1] (11.1,14.9]
## 436 406 377 380
Summary of alcohol group shows that alcohol values are somewhat equally distributed. The boxplots of alcohol levels show that the highest quality (8) have the highest alcohol in general. Also, while as quality increases, alcohol levels seem to increase, there is a dip in quality level of 5. Quality level 5 seems to have a few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 653 653 15.9 0.36 0.65 7.5
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 653 0.096 22 71 0.9976 2.98
## sulphates alcohol quality alcohol.group
## 653 0.84 14.9 5 (11.1,14.9]
Looking at the largest outlier in quality == 5, it seems that this is the max. alcohol level (x=653)
##
## Call:
## lm(formula = quality ~ sulphates, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2432 -0.5424 0.1102 0.4456 2.3977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.84775 0.07842 61.82 <2e-16 ***
## sulphates 1.19771 0.11539 10.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7819 on 1597 degrees of freedom
## Multiple R-squared: 0.0632, Adjusted R-squared: 0.06261
## F-statistic: 107.7 on 1 and 1597 DF, p-value: < 2.2e-16
We can see that sulphates over 1.0 are more dispersed. So filtering data that has suphates of 1.0 or below shows a more clear positive correlation with quality. While total sulphates explain 6% of variation in quality, sulphates equal to or less than 1.0 explain 15% of variation in quality.
##
## Call:
## lm(formula = quality ~ sulphates, data = subset(data, sulphates <=
## 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2217 -0.4898 -0.1171 0.4840 2.3795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9739 0.1008 39.42 <2e-16 ***
## sulphates 2.6136 0.1555 16.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7402 on 1539 degrees of freedom
## Multiple R-squared: 0.1551, Adjusted R-squared: 0.1545
## F-statistic: 282.4 on 1 and 1539 DF, p-value: < 2.2e-16
There seems to be an outlier with volatile acidity level of 1.6. Removing volatile acidity levels over 1.2 shows a stronger trend.
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = subset(data,
## volatile.acidity <= 1.2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79138 -0.54227 -0.00846 0.47108 2.93816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.57430 0.05924 110.98 <2e-16 ***
## volatile.acidity -1.77937 0.10697 -16.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7437 on 1593 degrees of freedom
## Multiple R-squared: 0.148, Adjusted R-squared: 0.1474
## F-statistic: 276.7 on 1 and 1593 DF, p-value: < 2.2e-16
I added jitter to all plots that include quality to avoid overplotting. Aside from one measure of acidity being correlated with another measure of acidity, I also noticed that measures of acidity and sugar are correlated with density.
Strongest correlation to quality is alcohol with a linear regression explaining 22% of variation in quality. Citric acid and sulphates are others that have stronger positive correlation with quality. I’ve also noticed with these variables that 82% of quality is 5 and 6. Jittering the plots to avoid overplotting revealed that while there is an overall upward sloping trend with these variables, concentrated values in quality in 5 and 6 is affecting their ability to explain variation in quality.
In some cases such as sulphates and volatile acidity, however, removing outliers helped increase R^2.
Other things to keep in mind include correlations between different chemical properties (different measures of acidity, density vs. alcohol and sugar). These will have to be controlled for in estimating quality.
Quality of wine is positively correlated with alcohol and sulphate. Volatile acidity is negatively correlated with quality. Citric acid is also more positively correlated than others, but less so than alcohol and sulphate.
The histogram shows that there tend to be more wines with higher quality The boxplot above shows that quality/alcohol is higher among higher sulphate levels. There are also more outliers in the lowest sulphate level.
This boxplot shows that higher citric acid groups have higher quality/alcohol levels but it doesn’t increase at the same rate. Citric acid levels higher than median value result in higher quality/alcohol than Citric acid levels lower than median value.
The scatterplots above show the relationship between sulphates and quality/alcohol in different quartile groups of citric acid. It shows that all groups of citric acid have a positive linear slope.
## [1] "[0,0.09]" "(0.09,0.26]" "(0.26,0.42]" "(0.42,1]"
##
## Call:
## lm(formula = quality/alcohol ~ sulphates, data = subset(data,
## data$citric.acid.group == "[0,0.09]"))
##
## Coefficients:
## (Intercept) sulphates
## 0.4106 0.1864
##
## Call:
## lm(formula = quality/alcohol ~ sulphates, data = subset(data,
## data$citric.acid.group == "(0.09,0.26]"))
##
## Coefficients:
## (Intercept) sulphates
## 0.51836 0.03681
##
## Call:
## lm(formula = quality/alcohol ~ sulphates, data = subset(data,
## data$citric.acid.group == "(0.26,0.42]"))
##
## Coefficients:
## (Intercept) sulphates
## 0.52662 0.03693
##
## Call:
## lm(formula = quality/alcohol ~ sulphates, data = subset(data,
## data$citric.acid.group == "(0.42,1]"))
##
## Coefficients:
## (Intercept) sulphates
## 0.49813 0.07759
Running linear regressions on each quartile group of citric acid shows that the 1st quartile of citric acid has the highest coefficient of sulphates in estimating quality/alcohol.
The boxplot shows that while median quality/alcohol values go down as it moves up on the volatile acidity scale, there are a lot of outliers in the highest volatile acidity group. Also, the maximum value of the 3rd quartile group is larger than the maximum value of the 2nd quartile group.
This scatterplot has geom_jitter added to avoid overplotting. It shows that the overall trend shown by the linear smoothing line indicates a positive relationship between quality and sulphates. Also, the darker (lower) alcohol levels seem more concentrated in quality level 5.
To examine this further, the histogram below shows the relationship between quality and sulphates where alcohol is less than the median of alcohol (10.20). It shows that quality = 5 has the highest concentration of sulphates at this alcohol level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The facet grid above shows the relationship between quality/alcohol with sulphates at different groups of volatile.acidity. It shows that at all levels of volatile.acidity, there is a positive linear trend between sulphates and quality/alcohol. I also noticed the outliers in 2nd quartile of the volatile.acidity group. Removing these outliers show that the 1st quartile of the volatile.acidity group has the highest linear slope (0.1225) between sulphates and quality/alcohol.
##
## Call:
## lm(formula = quality/alcohol ~ sulphates, data = subset(data,
## data$volatile.acidity.group == "(0,0.39]"))
##
## Coefficients:
## (Intercept) sulphates
## 0.4764 0.1225
##
## Call:
## lm(formula = quality/alcohol ~ sulphates, data = subset(VA2Q,
## VA2Q$sulphates < 1.5))
##
## Coefficients:
## (Intercept) sulphates
## 0.51403 0.04638
##
## Call:
## lm(formula = quality/alcohol ~ sulphates, data = subset(data,
## data$volatile.acidity.group == "(0.52,0.64]"))
##
## Coefficients:
## (Intercept) sulphates
## 0.51547 0.04134
##
## Call:
## lm(formula = quality/alcohol ~ sulphates, data = subset(data,
## data$volatile.acidity.group == "(0.64,1.58]"))
##
## Coefficients:
## (Intercept) sulphates
## 0.46137 0.09373
The below scatterplot shows the overall scatterplot of sulphates and quality/alcohol.
Thhe histogram shows the smoothing line after removing the outliers that are 1.5 and higher in sulphates.
The following scatterplots tweak and create new variables to better model the relationship with quality/alcohol.
Multiplying sulphates by 10 helps to eliminate the variance between values. Creating linear models based on this show that using these 4 variables (alcohol, sulphates, citric acid, volatile acidity) improve the R^2 by over 12%.
##
## Call:
## lm(formula = quality ~ alcohol, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ alcohol + sulphates * citric.acid + volatile.acidity,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71232 -0.38772 -0.06075 0.46631 2.26531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.33146 0.22553 10.337 < 2e-16 ***
## alcohol 0.30628 0.01579 19.393 < 2e-16 ***
## sulphates 1.18470 0.19066 6.214 6.6e-10 ***
## citric.acid 0.85977 0.32508 2.645 0.00825 **
## volatile.acidity -1.21410 0.11361 -10.687 < 2e-16 ***
## sulphates:citric.acid -1.37600 0.45161 -3.047 0.00235 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6571 on 1593 degrees of freedom
## Multiple R-squared: 0.34, Adjusted R-squared: 0.3379
## F-statistic: 164.1 on 5 and 1593 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ alcohol * I(sulphates * 10) + citric.acid +
## volatile.acidity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.76963 -0.38271 -0.08959 0.47365 2.24285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.46949 0.71146 9.093 < 2e-16 ***
## alcohol -0.07310 0.07004 -1.044 0.297
## I(sulphates * 10) -0.50221 0.10264 -4.893 1.10e-06 ***
## citric.acid -0.07938 0.10284 -0.772 0.440
## volatile.acidity -1.22336 0.11186 -10.937 < 2e-16 ***
## alcohol:I(sulphates * 10) 0.05673 0.01013 5.598 2.55e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6526 on 1593 degrees of freedom
## Multiple R-squared: 0.3489, Adjusted R-squared: 0.3469
## F-statistic: 170.8 on 5 and 1593 DF, p-value: < 2.2e-16
The last linear model explain 34.9% of variance in the quality of wines.
Because alcohol was the best indicator in explaining variance in quality of wines, I used other variables to explain quality/alcohol as a dependent variable. Relationships between other variables and quality were largely consistent with their relationships with quality/alcohol.
In reducing the wide variance between variables and quality/alcohol, multiplying sulphates by 10 had the best results. On other variables, multiplying numbers only widened the variance.
Using quality/alcohol metric helps avoid overplotting and the difficulty of visualizing quality values that are integers. The boxplot shows the trend clearly that higher sulphate groups have higher min, median, max quality/alcohol values. The four sulphate groups represent 1st, 2nd, 3rd and 4th quartile.
This plot supports other findings of the positive relationship between sulphates and quality. Using this finding with others, I use them later to add sulphates in the linear regression model to predict the quality of wines.
The above scatterplot shows sulphates in the x-axis and quality/alcohol in the y-axis. Each facet represents 1st through 4th quartiles of citric acid. Running linear regressions previously showed that 1st quartile of citric acid has the highest slope (0.1864).
The facet grid helps observe and compare scatterplots and this one supports previous findings that sulphates and quality/alcohol are positively correlated in all levels of citric acid, which is another more strongly correlated variable with quality. Using this finding with others, I add citric acid in my final linear regression model to predict quality.
Using quality/alcohol as the dependent variable, this scatterplot reduced the variance with the interaction between citric acid and sulphates. This shows a clear trend that as the interaction between citric acid and sulphates goes up, quality/alcohol goes up.
This scatterplot was derived by creating different variables using citric acid and sulphates to explain quality/alcohol. I used this in my final linear regression model.
Possible because the wine quality ratings are based on individuals’ subjective tastes, 82% of quality of wines were either 5 or 6. Because of this and the fact that quality variable were integers between 3 and 8, there were difficulties of visualizing and avoiding overplotting. As a way to solve this problem, I used quality/alcohol as the dependent variable instead. To investigate reasons why there were so many 5 and 6 quality ratings, it would be helpful to get data on other types of wines.
There were also correlations between variables rising from chemical properties (between different acidity meatures and between density and alcohol). I tried to account for these to avoid correlations between independent variables.